svCapture: efficient and specific detection of very low frequency structural variant junctions by error-minimized capture sequencing

Abstract Error-corrected sequencing of genomic targets enriched by probe-based capture has become a standard approach for detecting single-nucleotide variants (SNVs) and small insertion/deletions (indels) present at very low variant allele frequencies. Less attention has been given to comparable strategies for rare structural variant (SV) junctions, where different error mechanisms must be addressed. Working from samples with known SV properties, we demonstrate that duplex sequencing (DuplexSeq), which demands confirmation of variants on both strands of a source DNA molecule, eliminates false SV junctions arising from chimeric PCR. DuplexSeq could not address frequent intermolecular ligation artifacts that arise during Y-adapter addition prior to strand denaturation without requiring multiple source molecules. In contrast, tagmentation libraries coupled with data filtering based on strand family size greatly reduced both artifact classes and enabled efficient and specific detection of single-molecule SV junctions. The throughput of SV capture sequencing (svCapture) and base-level accuracy of DuplexSeq provided detailed views of the microhomology profile and limited occurrence of de novo SNVs near the junctions of hundreds of newly created SVs, suggesting end joining as a possible formation mechanism. The open source svCapture pipeline enables rare SV detection as a routine addition to SNVs/indels in properly prepared capture sequencing libraries.


INTRODUCTION
Detecting sequence variants in next-generation sequencing (NGS) libraries demands that the signal from true variant DNA molecules rises above the background signal from pre-analytical and computational process errors. Historically, confidence was achieved by sequencing samples with moderate to high variant allele frequencies (VAFs) and demanding independent detection of variants in multiple source DNA molecules (1). As researchers have increasingly sought to apply NGS to samples with much lower VAFs, focus has shifted toward reducing the sequencing error baseline to improve signal-to-noise ratios. Applications where it is essential to detect sequence changes with very low VAFs include characterization of heterogeneity in tumor samples (2), studies of the nature and occurrence of mosaicism and somatic mutations in tissues (3)(4)(5), detection of genetic alterations induced by genotoxicants (6,7) and assessment of the off-target effects of gene editing methods (8,9).
Error minimization in NGS requires a detailed understanding of the mechanisms that create those errors (Figure 1A and Table 1). Error minimization strategies can be divided into two categories. We define 'error suppression' as preventing pre-analytical, i.e. wet lab, error mechanisms from impacting sequenced libraries so that false variants are never detected. PCR-free libraries and treatments to remove base damage are example error suppression strategies that eliminate or reduce PCR and base-copying artifacts, respectively (10). 'Error correction' refers to bioinformatics strategies, sometimes supported by input library designs, that allow errors present in sequenced libraries to be recognized and discarded.
Progress toward practical, error-minimized sequencing was advanced by the invention of duplex sequencing (Du-plexSeq) (11,12). DuplexSeq is an error correction strategy in which unique molecular identifiers (UMIs) facilitate the recognition of the ends of individual source DNA molecules and the strands of those molecules that gave rise to sequenced read pairs. Computationally enforcing the detection of called variants on both DNA strands of a source molecule counteracts the most important error mechanisms giving rise to SNVs and indels--pre-existing base damage and polymerase copying errors (10)--which are inherently strand-specific ( Figure 1A). DuplexSeq has been extended  in derivative approaches dependent on the same core logic, such as NanoSeq and SaferSeqS (13,14). Together, duplex error correction has enabled single-molecule SNV detection and is having a substantial impact on the above applications (15)(16)(17)(18). A limitation of most error correction strategies and studies is that they have not addressed the detection of rare SVs. Like SNVs and indels, SVs are a critical form of mutagenesis associated with specific disease states and clastogenic genotoxicants (19)(20)(21). SVs alter many more bases than SNVs and indels but their junctions are less frequent (22), increasing the need for stringency when assessing SV rates. Critically, the mechanisms that give rise to SV arti-facts in sequencing libraries are very different from those that give rise to SNVs and indels, necessitating different error minimization strategies ( Figure 1A and Table 1). SV errors are created by three main mechanisms, listed here in their order of occurrence. Ligation artifacts arise from the joining of two source molecules by DNA ligase during adapter addition in libraries fragmented through sonication or non-tagmentation-based enzymatic fragmentation. Chimeric PCR occurs when a primer begins extension from one source DNA molecule only to switch to a different template in a subsequent cycle. Alignment errors occur when the aligner software reports an improper genomic location for a read or read segment.
Our efforts to address the challenges of error-minimized SV detection were motivated by our goal of defining the mechanisms of SV formation at CFSs (23)(24)(25)(26)(27). CFSs are genomic loci that are especially sensitive to replication stress and SV formation, representing genomic instability 'hotspots' (28)(29)(30). Effective study of the molecular events occurring at CFSs demands that de novo SVs be detected as they form, when only a single source molecule will carry a given junction. Similar needs for ultra-rare SV detection exist in genotoxicant monitoring, characterization of lowlevel mosaicism in tissues and other applications. Here, we carefully analyze the frequency and nature of SV artifact classes in target capture libraries made from DNAs bearing SVs characterized a priori or increased population burdens of rare induced SVs. DuplexSeq effectively addressed PCR artifact mechanisms, whereas tagmentation-based libraries coupled with bioinformatics filters addressed both PCR and ligation-mediated artifacts to yield signal-to-noise ratios suitable for monitoring rates of rare SV junction formation. The structures of hundreds of high-confidence and accurately sequenced SV junctions suggest that they are likely to arise by end joining.

Human fibroblast cell line, clone mixtures and stressed populations
Samples analyzed in this study are derived from HF1, a TERT-immortalized primary human foreskin fibroblast cell line used in our prior SV studies (27,30). HF1 was propagated in Dulbecco's modified Eagle medium supplemented with 13% fetal bovine serum, 4 mM L-glutamine and 1× penicillin-streptomycin. Initial experiments used mixtures of previously described HF1 cell clones bearing known SVs (27,30). We recovered the clones from our frozen stocks, expanded them and extracted DNA using the Qiagen Blood and Cell Culture Mini Kit (#13323). DNA was quantified using Qubit (Thermo Fisher), diluted to 30 ng/l and then mixed with DNAs from other clones or parental HF1 cells at known ratios. Further experiments used HF1 cell populations treated to induce de novo SVs. An exponentially dividing culture of HF1 cells was split and left untreated or exposed to either 0.2 or 0.6 M aphidicolin (APH) in dimethyl sulfoxide for 72 h. Cells were allowed to recover without APH to complete SV junction formation for a further period of 24 h. Cells were then split, seeded at 2 × 10 5 cells and expanded to confluence, after which DNA was extracted for sequencing.

Target capture probe design and nomenclature
Capture probes were targeted to large genes according to the design parameters in Figure 1B and Supplementary Table S1. Final probes were designed and synthesized by Twist Biosciences using their proprietary algorithms to ensure specificity and quality and used as provided by the vendor. When denoting the correspondence of SV endpoints to target regions, 'T' refers to bases within a capture target, 'A' refers to bases in the 250 kb regions adjacent to each capture target and '-' refers to all other genome bases. Uppercase letter pairs, e.g. 'TA', identify SVs with both ends in the same capture target, whereas lowercase letters, e.g. 'tt', identify all other unexpected, often artifactual, SVs.

DuplexSeq libraries
DuplexSeq libraries were prepared by TwinStrand Biosciences Inc. (Seattle, WA) according to published procedures (12,17,18). Briefly, genomic DNA was fragmented by ultrasonication to a peak size of 300 bp, followed by end repair, A-tailing and DuplexSeq™ adapter ligation. After indexing PCR, libraries were subjected to one round of overnight hybrid capture with biotinylated probes. Targets were purified with streptavidin magnetic beads, followed by washes and a final round of PCR prior to pooling and sequencing.

Tagmentation sequencing libraries
Bead-based tagmentation libraries were prepared using the Illumina DNA Prep with enrichment, formerly called the Nextera Flex, kit (Illumina #20025523) by the University of Michigan Advanced Genomics Core. Tagmented libraries were prepared using an input of 300 ng of genomic DNA, IDT for Illumina unique dual barcodes (Illumina #20027213) and library PCR amplification of nine cycles. Libraries were quantified using Qubit and quality was checked using an Agilent TapeStation. Capture was then performed by pooling 500 ng of each library and hybridizing with 4 l Twist Biosciences probes and 6 l PCR grade water. Enrichment was completed with 12 cycles of PCR amplification.

DNA sequencing
All sequencing reads were obtained in the 2 × 151 format using Illumina NovaSeq 6000 by the University of Michigan Advanced Genomics Core. Barcoded samples were pooled with each other and other users' samples and subjected to a sequencing depth calculated to yield a projected coverage of ∼2000-fold in the capture target regions based on prior experience. Library read counts, source DNA molecule yields and enrichment values can be found in Supplementary Table S2.

svCapture data analysis tools and versions
Data analysis and visualization, including plots in figures, were accomplished using the svCapture pipeline and app in v1.0.0 of the svx-mdi-tools suite (repository: https://github.com/wilsontelab/svx-mdi-tools; documentation: https://wilsontelab.github.io/svx-mdi-tools), supported by genomic modules in the main branch of the genomex-mdi-tools suite (https://github.com/ wilsontelab/genomex-mdi-tools), as implemented using the frameworks provided by the Michigan Data Interface (https://github.com/MiDataInt). Here, 'pipeline' refers to high-performance computing actions executed using various programs and custom shell, Perl and R scripts on the University of Michigan Great Lakes cluster, whereas 'app' refers to R Shiny graphical tools executed using R 4.0.3 on a Windows 10 desktop computer. The job configuration scripts used to launch the pipeline and associated log files that list all options and program versions are available at https://github.com/wilsontelab/ publications/tree/main/svCapture-2022. A simpler demo is available at https://wilsontelab.github.io/svx-mdi-tools/ docs/svCapture/toy-example.html.

svCapture pipeline
The svCapture pipeline 'align' action converts putative Du-plexSeq UMI sequences within read pairs to a UMI index from 1 to 96, when applicable, allowing up to one mismatch between the observed and expected UMIs. Reads with nonmatching UMIs are discarded. The fastp program (31) merges overlapping read pairs and enforces read quality filtering. The bwa mem program (32,33) aligns read pairs to the GRCh38/hg38 or other appropriate reference genome obtained from Illumina iGenomes with results stored in name-sorted CRAM files.
The 'collate' action performs source DNA molecule analysis. Two outermost molecule positions are selected from aligned segments passing a map quality filter (MAPQ > 20) to act as keys for comparing read pairs to each other. Read pairs that share the same UMI indices, when applicable, and the same endpoints and outer clip lengths within a 1 bp allowance are grouped and referred to as a source DNA molecule. For libraries using ligated Y adapters, such as Du-plexSeq, read pairs with opposite strand orientations are grouped as part of the same source molecule. In contrast, different read pair orientations sharing the same molecular endpoints must arise from independent Tn5 cleavage events and are considered different source molecules in tagmentation libraries. Read pair strand family sizes are counted, and a two-step consensus is constructed, first on each strand separately and then, when available, between the two strands. We used a fractional threshold of 0.667 shared bases in at most 11 downsampled read pairs for consensus calling; sequence positions failing this criterion are masked as N bases. svCapture includes steps for alignmentguided merging of smaller read pair overlaps not handled by fastp. Consensus sequences are finally realigned to the reference genome.
The 'extract' action examines final read alignments for patterns that predict an underlying SV junction (Supplementary Figure S1). Source DNA molecules are described as a series of nodes at the endpoints of all alignment segments, each with a chromosome, position and direction (left or right) that the molecule proceeds from that point with respect to the reference genome. Source DNA molecule replicates are defined by their shared outer nodes. Putative SV junction molecules are handled one at a time and characterized by their inner nodes. Node pairs defining novel junctions are sorted to reflect the top strand of the reference genome in non-inverted segments in the order found in the source molecules. Associated alignment sequences and CIGAR strings are carried forward after reverse complementing the alignments within (but not flanking) inversion segments as defined relative to the conjoined reference genome, such that all sequences now represent contiguous, reconstructed SV junctions. Nodes from all source molecules are compared to identify SV molecules that share outer endpoints with properly paired molecules, which provides evidence that those SV molecules are PCR artifacts derived by recombining other molecules.
The 'find' action searches extracted node tables for sets of molecules with distinct outer node signatures but the same inner node signatures, i.e. independent source molecules crossing the same SV junction (Supplementary Figure S1). Data are sorted by left nodes to allow molecules to be broken into sets where nodes are separated by less than the maximum proper insert size in the library as declared by bwa mem. Molecules within each set are re-sorted by the right node and the process is repeated to yield an evidence set supporting each putative SV. Additional steps include further purging of likely duplicate source molecules with outer endpoints closer than 5 bp in total, the recovery of source DNA molecules that were clipped at the junction on their outer edges by at least 5 bases and the reconstruction of junction sequences lacking split read evidence by joining the inner clipped nodes of molecules where junctions fell in read gaps. A single source DNA molecule with the most central junction is chosen as a reference for parsing final junction descriptions, including SV type, correspondence to capture targets, and the nature of microhomologies and de novo insertions. All steps can be performed on a single sample or on a set of input samples analyzed together to ensure sensitive matching of even single source molecules between samples.
Finally, the pipeline 'genotype' action compares the base content of SV calls to the non-SV molecules of the source individual. Proper molecules lacking SV junctions from CRAM files are analyzed by bcftools mpileup, call and norm (34) to establish a VCF file with a biallelic genotype at every sufficiently covered genome position within and adjacent to each capture target region. The resulting unphased haplotypes are compared to consensus SV junction sequences obtained from all aligned segments over all source molecules in each SV set. SV consensus bases absent from either the reference genome or haplotype variant lists are marked as presumptive SNVs or indels created during SV junction formation.

svCapture app and related downstream analysis
The svCapture app supports interactive visualization of SVs and enforces further adjustable SV filters, such as demanding duplex molecules and setting minimum strand family sizes and minimum and maximum numbers of supporting molecules or co-analyzed samples that share the same SV. For users wishing to examine SVs in other tools, a VCF file is also created where filtered SVs are encoded as breakends (SVTYPE = BND).
Matching of svCapture SV calls to prior data was achieved by first computationally comparing SV endpoints within a 25 kb allowance, given the imprecision of microarray calls. All potentially matching SVs were then manually examined for best matches, most of which were unambiguous as discussed below. VAFs were calculated for each sequenced SV as the count of total supporting molecules, including outer clips, divided by the average target region coverage. Expected VAFs were one half the clone frequency because all SVs were known to be heterozygous.
To assess the non-randomness of the positions of de novo SNVs and indels, the number of variant-containing SV consensuses crossing informative positions at each distance from the junction is counted to establish the expected distance distribution of randomly selected bases from those SVs, given the distribution of sample fragment sizes. The observed number of variants is randomly selected from this weighted distribution over 10 000 iterations. The P-value is the fraction of iterations whose median distance is less than the median of the observed variant distances. A significant result indicates that SNVs and indels are more likely to occur near the SV junction (Mann-Whitney and other tests are also available).

Design of cell mixtures and capture targets for svCapture validation
To help optimize capture sequencing for rare SV detection, we drew on established human HF1 fibroblast cell clones bearing known replication stress-induced SVs as established by genomic microarrays (27,30). We constructed a first set of five sample mixtures that were prepared, sequenced and analyzed together that bore varying levels of different HF1 SV clones from 1% to 30% (Supplementary Tables S3 and  S4). The composition of most mixtures was known to the data analyst, whereas the last sample was initially blinded and included unknown clone percentages and a subset of clones absent from other samples. A second set of five samples contained populations of HF1 cells treated with low doses of APH, a commonly used agent of replication stress that is highly effective at inducing de novo SVs (30,35). To further promote predictable SV patterns, we selected three genes from published HF1 SV data (27,30) as known, highvalue hotspots of replication stress-induced SV junction formation: NEGR1 (0.89 Mb, chr1), MAGI2 (1.44 Mb, chr7) and PRKG1 (1.31 Mb, chr10). Capture probes targeted a territory of 250 kb around the center of each hotspot gene ( Figure 1B and Supplementary Table S1), which bears the highest burden of induced SVs (30). This represented a substantial increase in target span as compared to typical prior DuplexSeq libraries (15)(16)(17)(18). Most SVs induced in these large genes are known to be deletions (30), which provided a further specificity test for method validation.
DNAs from the above sample sets were subjected to target capture and, initially, sequencing by DuplexSeq. A subset of the same DNAs was later sequenced again using tagmentation. We wrote a data analysis pipeline in the Michigan Data Interface because prior pipelines were optimized to address duplex consensus making or SV detection but not both (12,17,36). Like prior pipelines, svCapture groups read pairs by source molecule and strand, creates singlestrand and, when possible, duplex consensus sequences, and realigns those consensuses to the reference genome. Additional features include read pair merging to maximize alignment specificity, persistence of all source molecules and coordination of strand tracking with SV junction calling.

DuplexSeq error correction eliminates chimeric PCR artifacts
Our entry rationale was that DuplexSeq would correct for chimeric PCR artifacts in capture libraries, which our early efforts suggested was the major confounder to SV detection. PCR is the last step in most capture protocols to create sufficient DNA mass for sequencing. It is problematic because it can promote chimeric PCR that is highly likely to fuse two molecules enriched by capture, mimicking the SV junctions we wish to detect. Chimeric PCR occurs after source DNA molecules are melted ( Figure 1A) and should be amenable to duplex correction as it is highly unlikely that the same chimeric PCR event would occur twice independently, once in each template strand orientation. Because we know the properties of our source samples and genomic loci well (27,30), we can assert that apparent translocations between capture target regions are nearly always artifacts. True SVs will be mainly deletions with one captured endpoint within a target (T) region and the second endpoint within or adjacent (A) to the same region. Second endpoints at large distances from the target region are again mostly artifacts.
We used plot devices described in Supplementary (Table 1). Consistently, requiring that at least one duplex molecule supported an SV call eliminated this error class ( Figure 2B and Supplementary Figure S3B). Notably, duplex filtering was not required to correct for false chimeric PCR SV calls. Examining SV strand family sizes, i.e. the number of replicate read pairs from each source molecule strand (Supplementary Figure S1A), showed that intertarget translocation SVs of the 'tt' class rarely showed more than a single supporting read pair (Supplementary Figure S4). Moreover, presumptive chimeric PCR events often shared one or both outer endpoints with proper molecules from the same library, the likely templates for late cycle chimeric PCR (Supplementary Figure S5A). Enforcing strand family size ( Figure 2C and Supplementary Figure S3C) and/or shared endpoint ( Supplementary Figure S5B) filters proved effective for correcting chimeric PCR artifacts without sacrificing >50% of source molecules that failed duplex detection (Supplementary Table S2).

DuplexSeq error correction alone cannot eliminate intermolecular ligation artifacts
Examining Figure 2C and Supplementary Figure S5B revealed a residual class of presumptive SV artifact junctions that did not respect the capture target boundaries, e.g. translocation junctions extending throughout the target (T) and adjacent (A) regions. Many of those remaining artifact junctions fell in the unsequenced gaps between read pairs (hence, they could not be plotted in Figure 3C), consistent with them being especially large molecules with central junctions, the expected pattern for intermolecular ligation artifacts. Consistently, this class of artifact junctions extended at the same baseline level throughout the untargeted genome ( Figure 4A) and had a distinct junction profile characterized by blunt joints or single-base insertions or deletions ( Figure 4B), as predicted for intermolecular ligation artifacts (1041 of 1293, 81%, of the one-base insertions were A or T, suggestive of A-tailing). Because such artifacts arise prior to strand melting ( Figure 1A), duplex filters alone are not expected to correct them (Table 1) and indeed they did not (Figures 2B and 4C, and Supplementary Figure S3B). Even with duplex filtering, tens of thousands of target-to-genome SV calls with predominantly blunt end junctions persisted that would confound many applications. Both ligation and chimeric PCR artifacts could be corrected by demanding multiple independent source molecules (Figure 2D), but that restriction prevents the detection of many rare, true SVs (see below).

Tagmentation capture libraries largely, but incompletely, suppress ligation artifacts
Because there was no practical method for correcting ligation artifacts, we turned to error suppression to achieve this important goal. Because tagmentation libraries use the Tn5 transposase to connect adapters to genomic DNA molecules, not DNA ligase, ligation artifacts are not expected (Table 1). Comparing tagmentation to DuplexSeq libraries in the absence of any filters first revealed a new class of frequent, small inversions with large microhomologies that appeared uniquely in the tagmentation libraries (compare Figure 3A and E). Others have described that such errors arise by intramolecular foldback synthesis that is especially prevalent in tagmentation libraries, most likely during the extension of tagmented ends (37,38). Somewhat surprisingly, chimeric PCR artifact rates were lower in  Table S2). Family size filters were again highly effective at removing intertarget chimeric PCR artifacts as seen for DuplexSeq ( Figure 2G and G). Most importantly, the rates of intermolecular ligation evident as 't-' class molecules were markedly reduced in tagmentation libraries ( Figure 4D). The few such molecules that did occur shared a junction microhomology profile characterized by predominantly blunt ends ( Figure 4E). This unexpected result suggests a low degree of ligation activity in tagmentation libraries (see the 'Discussion' section).

svCapture faithfully reports SV junction locations
To ensure that error correction and suppression were not eliminating true SV molecules, we examined our samples with mixed SV-containing clones, requiring three read pairs to call an SV. Figure 5A and B shows a clear match to the SVs known to exist from prior microarray analysis down to a 1% allele frequency (compare detected SVs to the black circles denoting expected SVs). One expected 25 kb duplication in the NEGR1 gene did not have an explicitly matching call, but we observed a nearby 4.5 kb duplication (red circle in Figure 5B) with unambiguous molecular evidence (Figure 5C and Supplementary Figure S6A). We interpret this as an error in microarray assessment of this SV, a method with much less precision in endpoint assignments, especially for duplications. We similarly observed apparently undetected SVs when we examined our initially blinded sample ( Figure 5D). While we cannot rule out that some SVs were missed by sequencing, we again noted the presence of other strongly evidenced SV junctions leading us to question the precision of the microarray calls. For example, HF1 clone Scr1-14A had two ∼75 kb deletion calls in MAGI2 by microarray (Supplementary Table S4), whereas sequencing established the existence of a single, overlapping, unambiguous 149 kb deletion in the pool (red circle in Figure 5D; Supplementary Figure S6B). Other unexpected but strongly evidenced SVs were also noted at low VAFs, such as the deletion highlighted by a red circle in Figure 5A and depicted in Supplementary Figure S6C with six supporting molecules.

svCapture reports SV allele frequencies with modest accuracy
To explore the quantitative efficiency and accuracy of svCapture, we plotted the observed versus expected SV VAFs. While many SVs in the 10% mixture were recovered at approximately the expected VAF of 5%, those of the TA class showed lower VAFs ( Figure 5E). SVs with only one end in a capture target region could thus be detected but at a lower efficiency, presumably due to reduced binding to capture probes. The 10% and 1% clone mixtures showed only a modest correlation (R = 0.74), in a pattern that initially suggested a reduced detection efficiency at lower VAF. However, plotting the observed versus expected VAF for mixtures with variable content showed the opposite trend, with underrepresentation of the SVs with higher expected VAF ( Figure 5F). SV recovery efficiency is thus variable and a function of several factors. Importantly, in this study one of those factors is the uncertain accuracy with which the clone mixtures were prepared.

svCapture permits SV assessments at very low allele frequencies
While reliable detection of SV junctions at 1% VAF is valuable, it falls short of the needs of many applications. Therefore, we examined our APH-treated cell populations expected to carry a high burden of individually rare SVs across different cells. Figure 6A and B shows the ready detection of increased SV junctions by either sonication-based or tagmentation-based svCapture following APH treatment of HF1 fibroblasts. In aggregate, these induced SVs were very often single-molecule detections (305 of 415, 73%), consistent with the expectations of de novo SV formation in a large cell population without clonal restriction. Induced SVs were strongly biased toward deletions (359 of 415, 87%), as we have previously observed at large gene loci (30), which, together with their induction by APH, provides excellent support for their existence in the source DNA. We further observed an apparent dose responsiveness of SV induction by APH and that tagmentation appeared more efficient than DuplexSeq at rare SV detection ( Figure 6C), although more data are needed to draw firm conclusions on these points. The basal rate of SV formation in untreated cells is expected to include some persistent artifacts noted above (e.g. Figure 4D), but the fact that they are also mostly deletions in the middle of large transcribed genes suggests that many correspond to the basal copy number variant rate we have documented with microarrays (30).

DuplexSeq reveals a low rate of de novo insertions, SNVs and indels near SV junctions
We finally examined properties of SV junctions that are useful for inferring the associated DNA repair mechanisms. The hundreds of SVs induced by APH showed a junction profile distinct from either artifact class characterized by short, mainly 2-4 bp, microhomologies and a tail of junctions with de novo sequence insertions of up to 25 bp (Figures 4B and E and 6). These patterns provide further validation that these are valid SV calls distinct from the SV artifact classes. The high accuracy of DuplexSeq, coupled with abundant read depth for creating unphased haplotypes of the source individual, further allowed us to examine the rate of de novo SNVs and indels in the alignments flanking the SV junctions (see Figure 7A and Supplementary Figure S7 for an example). We measured a net rate of <1 in 1000 variant base positions in duplex SV calls that could not be accounted for by constitutional SNPs and indels in the HF1 genome ( Figure 7B). The positions of the few variants observed in subclonal SVs from mixed-clone samples ap-peared random, whereas there was a small but statistically significant bias (P ≤ 0.038) for the more numerous APHinduced SVs for being located closer to the junction than expected by random chance ( Figure 7C).

DISCUSSION
We describe svCapture, which extends error-minimized sequencing to SVs to enable a variety of downstream applications in molecular genetics, cancer biology, genetic toxicology and other fields. A first conclusion, consistent with prior work with whole genome sequencing, is that preanalytical process errors, including intermolecular ligation and chimeric PCR, can be corrected by demanding that multiple independent source molecules match an SV junction (1,20). Demanding even two source molecules provides strong correction since all relevant wet lab process errors occur after fragmentation such that distinct molecule outer endpoints establish that an SV junction was present multiple times in the source DNA. The processes for determining source molecule independence and matching SV junctions must be meticulous but are well described (20,36), and SV junctions provide abundant information for accurately grouping source molecules. The challenge we addressed was extending SV detection to capture NGS where sequencing multiple independent source molecules is impractical or impossible due to low VAFs in input samples. Relevant applications include realtime monitoring of SV junction formation prior to replication in mechanistic studies, where only one source molecule could possibly carry the junction, or SVs arising just prior to the last mitotic division in neuronal or germ cell development (39)(40)(41) or during short-duration genetic toxicology studies (6,7,42). Even when SV junctions are subclonal, having been replicated during prior cell divisions, they are effectively unique at very low VAFs with respect to randomly sampled NGS libraries unless prohibitively deep sequencing is performed. At an average on-target coverage of 2400 in bulk samples, we readily detected known SVs at 1% VAF with multiple molecules ( Figure 5 and Supplementary Figure S6) but SVs would often yield single-molecule detection at 0.1% VAF. Moreover, we observed only a modest ability of svCapture to accurately score VAFs. Multiple factors likely contribute to variable detection efficiency, including the number of SV ends captured, GC content, probe coverage, competition during capture and PCR, and others.
Of the three SV error classes ( Figure 1A), we were the least concerned about alignment errors. Modern aligners, such as the bwa mem utility used in svCapture (32,33), provide reliable information regarding alignment confidence in the form of MAPQ scores. Enforcing MAPQ filters reduces the detection of true SV junctions in repetitive DNA, but that lack of sensitivity is inherent to short-read libraries. Long-read technologies remain essential for characterizing SVs in repetitive regions (43)(44)(45). The greatest residual problem we have found regarding read alignment is for the subset of junctions containing short tandem repeats, where sequence changes arising during library preparation can cause aligners to give falsely high MAPQ scores.
In contrast, the remaining error classes--intermolecular ligation and chimeric PCR--readily yield false SV junctions with high MAPQ scores. Other methods are required to minimize their impact. Our goal was to improve SV signal-to-noise ratios by reducing artifact background, since true signals are obligatorily low in our target applications. Several computational filters could effectively correct for chimeric PCR errors, including duplex strand detection in DuplexSeq and high read pair and low shared endpoint counts in all library types (Figures 2 and 3, and Supplementary Figures S3 and S5).
Thus, ligation artifacts proved of greatest concern. Du-plexSeq logic could not correct for intermolecular ligation because the artifact occurs prior to strand melting, and no other computational filter proved capable of doing so other than source molecule counts. However, libraries that use tagmentation to covalently bond adapter sequences to genomic DNA were highly effective at suppressing ligation artifacts from ever appearing in sequenced DNAs ( Figure  4). Tagmented DNAs cannot typically support duplex error correction because they lack dual strand identification, but, as noted above, chimeric PCR can be corrected in other ways such that highly sensitive and specific detection of even single-molecule SVs was achieved by tagmentation-based svCapture (Figures 2, 3, 5 and 6). Like others, we noted a class of small inversion artifacts in tagmentation libraries (37,38), but this does not impair the accurate calling of inversions larger than ∼1 kb or deletions, duplications or translocations.
Surprisingly, some apparent intermolecular ligation was observed in tagmentation libraries, even if greatly reduced relative to DuplexSeq. This inference is drawn because the artifacts in question had second SV ends spread evenly throughout the genome and the same predominantly blunt end junction structure as seen with DuplexSeq ( Figure 4). We do not know whether the relevant phosphodiester bond formation is catalyzed by Tn5 itself or by some amount of ligase in the proprietary Illumina DNA Prep kits.
Importantly, any wet lab approach that prevents intermolecular ligation would suppress that artifact mechanism. The ideal would be a library prep that is both PCR-free and ligation-free. PCR-free tagmentation libraries are now offered by vendors but with unspecified details as to how sequencer-ready libraries are created following Tn5 action. Quispe-Tintaya et al. reported structural variant search (SVS) (42,46) for use on the Ion Torrent platform. Although SVS uses ligation to add second adapter strands, it does so after homopolymer tailing that reduces intermolecular ligation. Xing et al. reported META-CS in which pools of barcodes on Tn5 mosaic ends support duplex logic in tagmentation libraries (47). These innovations in SV error suppression were applied to whole genome sequencing, which requires a commitment to higher sequencing costs as compared to target capture. However, META-CS could be adapted to capture sequencing where it would be expected to provide best sensitivity and accuracy for rare SV junction detection at the expense of a more complex library preparation, as compared to the commercial kits shown here to be suitable for detecting rare induced SV junctions above a baseline.
Data reported here represent the largest set of de novo SV junctions yet acquired in a controlled, prospective, experimental paradigm. They established in high-accuracy calls two features of induced SV junctions, made possible in part by the application of DuplexSeq logic. First, the APHinduced junction sequence profile is characterized by short microhomologies and small de novo insertions (Figures 4  and 6) that are a good match to DNA polymerase thetamediated end joining (TMEJ) (48,49). We also observed a low but measurable rate of de novo SNV acquisition near SV junctions that showed a biased localization in the first 50 bases on either side of the junction (Figure 7), consistent with a process confined to end sequences. This combination of features suggests that TMEJ or another end joining mechanism might create many SV junctions, with TMEJ favored since we previously showed that a canonical non-homologous end joining factor, Xrcc4, is not required for SV junction formation (29). More data are needed, however, since both microhomologous junctions and errorprone copying have also been invoked for replication-based SV formation models such as microhomology-mediated break-induced replication (50,51).
In summary, sensitive and specific detection of very rare SV junctions can be achieved in short-read capture libraries with sufficient attention to suppressing intermolecular ligation events by appropriate choices of library preparation protocols and correcting chimeric PCR artifacts by filters in data analysis pipelines. The svCapture procedure has a high specificity and efficiency but modest quantitative accuracy for assessing VAFs. There are unavoidable limitations associated with short reads that include the inability to detect SV junctions in repetitive DNA or those mediated by recombination between longer regions of homology. Moreover, even in our optimized tagmentation approach the background of target-to-genome SV artifacts is not zero. Nevertheless, many important classes of non-homologous SVs can be readily detected above very low background signals enabling intersample comparisons of the formation rates of even single-molecule SVs. The method is well suited to studies of SV formation mechanisms, somatic SVs and SVinducing genotoxicants.

DATA AVAILABILITY
All codes used in data analysis and figure generation are available for download as part of the svx-mdi-tools repository at https://doi.org/10.5281/zenodo.7871677 and https: //github.com/wilsontelab/svx-mdi-tools, via its svCapture pipeline and app. HF1 svCapture raw read data, in FASTQ format, and all called SVs, in VCF format, are available from the Database of Genotypes and Phenotypes (dbGaP) under accession phs003121.v1.p1.